Modeling prosodic sequences with k-means and dirichlet process GMMs
نویسنده
چکیده
In this paper we describe two unsupervised representations of prosodic sequences based on k-means and Dirichlet Process Gaussian Mixture Model (DPGMM) clustering. The clustering algorithms are used to infer an inventory of prosodic categories over automatically segmented syllables. A tri-gram model is trained over these sequences to characterize speech. We find that DPGMM clusters show a greater correspondence with manual ToBI labels than k-means clusters. However, sequence models trained on k-means clusters significantly outperform DPGMM sequences in classifying speaking style, nativeness and speakers. We also investigate the use of these sequence models in the detection of outliers regarding these three tasks. Non-parametric Bayesian techniques have the advantage of being able to learn a clustering solution and infer the number of clusters directly from data. While it is attractive to avoid specifying k before clustering, on the tasks of characterizing prosodic sequences we find that effective use of DPGMMs still requires a significant amount of parameter tuning, and performance fails to reach the level of k-means.
منابع مشابه
Pitch-dependent GMMs for text-independent speaker recognition systems
Gaussian mixture models (GMMs) and ergodic hidden Markov models (HMMs) have been successfully applied to model short-term acoustic vectors for speaker recognition systems. Prosodic features are known to carry information concerning the speaker’s identity and they can be combined with the short-term acoustic vectors in order to increase the performance of the speaker recognition system. In this ...
متن کاملLearning When to Listen: Detecting System-Addressed Speech in Human-Human-Computer Dialog
New challenges arise for addressee detection when multiple people interact jointly with a spoken dialog system using unconstrained natural language. We study the problem of discriminating computer-directed from human-directed speech in a new corpus of human-human-computer (H-H-C) dialog, using lexical and prosodic features. The prosodic features use no word, context, or speaker information. Res...
متن کاملMon.O2b.04 Learning When to Listen: Detecting System-Addressed Speech in Human-Human-Computer Dialog
New challenges arise for addressee detection when multiple people interact jointly with a spoken dialog system using unconstrained natural language. We study the problem of discriminating computer-directed from human-directed speech in a new corpus of human-human-computer (H-H-C) dialog, using lexical and prosodic features. The prosodic features use no word, context, or speaker information. Res...
متن کاملEffects of compatible versus competing rhythmic grouping on errors and timing variability in speech.
In typical speech words are grouped into prosodic constituents. This study investigates how such grouping interacts with segmental sequencing patterns in the production of repetitive word sequences. We experimentally manipulated grouping behavior using a rhythmic repetition task to elicit speech for perceptual and acoustic analysis to test the hypothesis that prosodic structure and patterns of ...
متن کاملRevisiting k-means: New Algorithms via Bayesian Nonparametrics
Bayesian models offer great flexibility for clustering applications—Bayesian nonparametrics can be used for modeling infinite mixtures, and hierarchical Bayesian models can be utilized for sharing clusters across multiple data sets. For the most part, such flexibility is lacking in classical clustering methods such as k-means. In this paper, we revisit the k-means clustering algorithm from a Ba...
متن کامل